## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
All variables have numeric or integer values. Variables represent chemical compositions of wines. Output variable (based on sensory data): quality (score between 0 and 10). Quality score takes values from 3 to 8 out of 10. Other variables are dispersed to a different degree. Variables are measured to different amounts of significant numbers, have zero, one or two decimal points.
Quality scores spread between 3 and 8, majority in 5-7 range. There are only 10 scores available. It makes sense to create a categorical variable out of scores for further analysys.
Citric acid distribution looks multimodal with few peaks at 0, 0.25 and 0.5. It would make sense to test how this is related to other variables, including ‘quality’. There are some outliers.
pH values distributed bell-shaped. Most fall into 3.0-3.75 range.
‘volatile.acidity’ may have a bimodal distribution. It may be worth looking how different peaks are related to the rest of the data.
‘free.sulfur.dioxide’, concentration of free sulfur dioxide, has skewed to the left distribution. X-axis transformation does not change the distribution significantly.
Total sulfur dioxide distribution in similar to the distribution of the free one. Those two variables may be dependent on each other.
Histograms were plotted with the bin sizes adjusted for the maximal resolution allowed by data. Variables have different distributions. “density” appears to be bell-shaped, “quality” is probbaly bell-shaped too, but it is hard to tell for sure with that few sensory scores. “residual.sugar”, “chlorides” and “sulphates” are bell-shaped-like for the majority of values with very few values in the tails on the right side. “fixed.acidity”, “total.sulfur.dioxide”, “alcohol”, “free.sulfur.dioxide” are skewed to the left. “volatile.acidity” and “pH” could be bimodal. “citric.acid” looks most interesting and without definitive shape. There are spikes at zero and 0.50 values which may reflect some specifics of the manufacturing process of those varieties which are unavalable to us for analysys. It would be interesting to find out what accounts for that particulr shape of the citric acid level distribution and how it correlates with the “quality” values.
There are 1599 wines and 12 features: “fixed.acidity”,“volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol” and “quality”. All variables are numeric or integer types.
The main feature of interest is a sesnsory assesment of wine quality (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). In this dataset all values falls into the interval between 3 and 8, most values are in the 5-7 range.
All other features are concentrations of various chemicals each of which and their combinations can influence taste, so probably all of them are important, although some of them may be related to each other, e.g. “free.sulfur.dioxide” and “total.sulfur.dioxide”, “pH” and amounts of acids.
I created a variable called “quality.category” which is a categorical representation of the “quality” variable. I beloieve it makes sense because the “quality” is a sensory assessment and there are only few steps on a scale.
There were some skewed distributions, such as “free.sulfur.dioxide”, “total.sulfur.dioxide”, “alcohol”. “volatile.acidity” may be bimodal. “citric.acid” does not have definitive shape. There was no need to perform operations to adjust the form of data.
There are some noticeable trends revealed by boxplots. Some features seem to correlate with the quality score. Higher rated wines have higher alcohol and citric acid levels and lower levels of pH, density and “volatile.acidity” (acetic acid). Lower pH could be associated with higher alcohol levels, as a result of the fermentation process and also with higher citric acid content. It would be interesting to see which of those features is the most important for the sensory score. Rest of the features, such as “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide” may not have a correlation with “quality”. Interestingly, relatively large number of outliers seem to be associated with quality scores 5, 6 and 7 for a number of features, like “chlorides”, “sulphates”, etc. It looks like there is more variability for mid-range quality wines in non-essential features.
## [1] 0.831146
## [1] 0.6516203
‘fixed.acidity’ means and medians stratified by quality strongly correlates with quality scores which confirm trend observed by examining boxplots
## [1] 0.9831221
## [1] 0.9724308
‘citric.acid’ means and medians stratified by quality strongly correlates too with quality scores which confirm trend observed by examining boxplots
No conclusive difference between two peaks
Scatter plots of quality scores vs features where many more outliers are located in scores 5, 6 and 7, such as “alcohol’, ‘citric.acid’ or ‘volatile.acidity’, reveal that there are many more values located in these categories. Very small numbers of values fall into quality categories 3, 4 and 8. This is due to the fact that most wines are mid-range quality. This probably explains in part that the observation in a boxplot section (above) that mid-range wines features have more outliers. There are many more values for features in those categories. I am wondering if that trend would percist if there were comparable amounts of wines in each quality category.
Histograms faceted by quality scores provide essentialy the same information as boxplots/scatter plots.
Some of the features are related and some not.
## [1] -0.2561309
Not much correlation here. This is understandable, fixed.acidity - tartaric acid and volatile.acidity - acetic acid
## [1] -0.6829782
## [1] -0.7063602
This is rather strong correlation, even stronger with log10 transformation. It is somewhat expected, pH is the measure of the acidity in the solution. The lower the pH (higher acidity) the higher concentration of the tartaric acid, which explains negative correlation.
## [1] 0.2349373
Some correlation between pH and acidic acid, but not a very strong one. For some reason correlation is positive. Other factors may be involved.
## [1] 0.2056325
Weak correlation between pH and alcohol. The reason could be that most CO2 is being eliminated during fermentation.
## [1] 0.1099032
No relation between alcohol and citric acid.
## [1] -0.08565242
No correlation here.
## [1] 0.04207544
I would think that there might be relation between sugar that remains after fermentation and the alcohol because sugare is a substrate for fermentation, but apparently there is none here.
## [1] 0.6676665
Strong correlation here which is expected.
## [1] 0.3552834
Moderate correlation, which is expected.
## [1] -0.5419041
Substantial negative correlation, which is expected.
## [1] 0.6717034
Interestingly, quite strog correlation between tartaric acid and citric acid concentrations.
There are some noticeable trends revealed by boxplots. Some features seem to correlate with the quality score which is a feature of interest in this dataset. Higher rated wines have higher alcohol, tartaric and citric acid levels and lower levels of pH, density and “volatile.acidity” (acetic acid). For instance, r=0.83 for correlation between medians of fixed.acidity in each quality category and quality score and even stronger correlation (r=0.98) for citric.acid medians and quality score. It would be interesting to see which of those features is the most important for the sensory score. Rest of the features, such as “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide” may not have a correlation with “quality”. Relatively large number of outliers seem to be associated with quality scores 5, 6 and 7 for a number of features, like “chlorides”, “sulphates”, etc. Same observations were made by scatter plotrs as well. Scatter plots of quality scores vs features where many more outliers are located in scores 5, 6 and 7, such as “alcohol’, ‘citric.acid’ or ‘volatile.acidity’, reveal that there are many more values located in these categories. Very small numbers of values fall into quality categories 3, 4 and 8. This is due to the fact that most wines are mid-range quality. This probably explains in part that the observation in a boxplot section (above) that mid-range wines features have more outliers. There are many more values for features in those categories. I am wondering if that trend would percist if there were comparable amounts of wines in each quality category.
I would think that there might be relation between sugar that remains after fermentation and the alcohol level because sugar is a substrate for fermentation. For instance if fermentations starts with the same initial amount of sugar in each wine, then more alcohol would mean less remaining sugar, but apparently there is no relation here. It could be that the initial sugar amounts are very different in each wine before fermentation or wines are the combinations of different varieties after the fermentation ended.
There is some correlation between pH and acidic acid, but not a very strong one. For some reason correlation is positive. One would expect a negative correlation here. Lower pH should correlate with higher amounts of acid which is the case here with tartaric acid, but not acetic acid. Apparently the relationship could be more complex in nature. Other influencing factors may be involved.
Strong relationship between tartaric acid and pH. Apparently tartaric acid have a significant influence on pH.
Interestingly, quite strog relationship between tartaric acid and citric acid concentrations. Tartaric acid (fixed.acidity) and citric acid concentrations are distributed quite differently, but are strongly correlated with each other.
There is rather strong correlation (r = -0.68), even stronger with log10 transformation (r = -0.71) between pH and tartaric acid concentration. It is somewhat expected, pH is the measure of the acidity in the solution. The lower the pH (higher acidity) the higher concentration of the tartaric acid, which explains negative correlation. Interestingly, quite strog correlation (r = 0.67) between tartaric acid and citric acid concentrations. Also a strog relationship was between free and total sulfur dioxide concentrations (r = 0.67), which is expected too.
Most higher quality wines associated with higher citric acid level and lower pH in the upper left corner
Most higher quality wines associated with higher tartaric acid level and lower pH in the upper left corner
Same conclusion as of above scatter plot can be drown analysing pH/fixed.acidity ratio density plot
Most higher quality wines associated with higher tartaric and citric acids levels in the upper right corner
Higher quality associated with higher alcohol and lower pH
Same conclusion as of above scatter plot can be drown analysing alcohol/pH ratio density plot
Wines with average quality (5-6) seem to locate mostly in the middle of density and pH ranges and higher quality wines are located throught the range of high density/lower pH to low density/high pH
According to a density plot there may be some trend in quality wines locating in lower pH/density ratio area, which corresponds to higher density/lower pH of above scatter plot. This was not obvious when analysing boxplots of ‘pH’ and ‘density’ vs ‘quality’ and a scatter plot of ‘pH’ over ‘density’.
Most higher quality wines were associated with higher citric acid level and lower pH. Same was true for a tartaric acid and pH relationship. Also, higher quality was related to higher tartaric and citric acids levels. In addition, higher quality was associated with higher alcohol and lower pH levels.
According to a scatter plot wines with average quality (5-6) seem to locate mostly in the middle of density and pH ranges and higher quality wines are located throught the range of high density/lower pH to low density/high pH. A density plot reveals that there may be some trend in quality wines locating in lower pH/density ratio area, which corresponds to higher density/lower pH of a scatter plot. This was not obvious when analysing boxplots of ‘pH’ and ‘density’ vs ‘quality’ and a scatter plot of ‘pH’ over ‘density’.
Distributions of two acids are different. Means are indicated by red dashed lines and medians are indicated by blue dashed lines.
## [1] 0.2709756
## [1] 0.26
Mean and median are close to each other for citric acid.
## [1] 8.319637
## [1] 7.9
For tartaric acid mean and median are separated indicating skewness of the distribution. Distribution of tartaric acid is bell-shaped with some skewness and distribution of citric acid is sort of irregular in shape. Sunstantial number of wines do not have acidic acid at all.
## [1] 0.9831221
Medians of citric acid concentrations in each sensory score category are correlated with the quality scores (r=0.98). The higher the median of the citric acid, the higher the sensory score.
## Source: local data frame [6 x 2]
##
## quality n
## (int) (int)
## 1 3 10
## 2 4 53
## 3 5 681
## 4 6 638
## 5 7 199
## 6 8 18
Many more outliers are located in scores 5, 6 and 7 due to the fact that many more values located in general in these categories. Most wines are of a mid-range quality.
## [1] 0.6717034
Tartaric acid concentration is related to a citric acid concentration (r=0.67)
## [1] 0.8961178
Even stronger correlation between medians (r=0.90)
Lower tartaric acid corresponds to lower citric acid and higher tartaric acid corresponds to higher citric acid. Higher sensory scores are mostly present in a upper right quadrant, which indicates association of “quality” with higher concentrations of both acids. Those two variable can be used for model building.
Red wine dataset has 1599 wine varieties from Portugal. Variables represent chemical compositions of wines. The purpose of the dataset could be to build a predictive model of percived quality of a wine based on its chemical composition. All variables have numeric or integer values. Output variable (based on sensory data): quality (score between 0 and 10). In this dataset quality score takes values from 3 to 8 out of 10. Other variables are dispersed to a different degree. Variables are measured to different amounts of significant numbers, have zero, one or two decimal points. I realized the output variable can be both in numerical and categorical representation. Both kinds can be used for the analysys, so an extra column with a categorical quality score was added to a dataset. Analysys reveals that some variables are associated with the quality score level, such as tartaric and citric acids concentrations, pH, alcohol, density. Other features, such as chloride, residual sugar, sulphates levels do not seem to be associated with the quality. Most of the wines are of a mid-range quality, so low and high quality wines are underrepresented in the dataset. It would certainly help to improve the power of the analysys if more wines added to low and high quality categories. Chemical composition of wines is very complex and many more features can be added, such as various inorganic and organic components, which can improve the analysys and increase a predictive power of the prospective model. To compensate in part for the lack of a comprehensive chemical characterization some other features can be included, such as region of the winery, age of the wine, etc., to improve the power of the analysys.